import pandas as pd
import plotly
from matplotlib import pyplot as plt
import numpy as np
import seaborn as sns
pd.set_option('display.max_columns', 110)
pd.set_option('display.max_rows', 200)
import warnings
warnings.filterwarnings('ignore')
import plotly.express as px
listing = pd.read_csv('CSV/listings.csv')
reviews = pd.read_csv('CSV/reviews.csv')
listing.head()
# Converting price column to float
listing['price'] = listing['price'].astype(str)
def func_remove_character(x, character):
""" Returns string with the character removed
Args:
x (str): The string that will have the character removed
character (str): character to be removed from x
Returns:
String with the symbol removed
"""
return x.replace(character, '')
listing['price'] = listing.price.apply(lambda x: func_remove_character(str(x), '$'))
listing['price'] = listing.price.apply(lambda x: func_remove_character(str(x), ','))
listing['price'] = listing['price'].astype(float)
listing['price'].describe() # 75% of price values is below 600.
sns.violinplot(x=listing.loc[listing['price'] < 1000, 'price']);
Some listings contains a price with value zero, which is, clearly, a mistake. This values are dropped from the dataset since there are few of them.
listing.loc[listing['price'] <= 0, 'price']
listing.drop(listing.loc[listing['price'] <= 0].index, axis = 0, inplace = True)
Since price is not well distributed, a binning technique will be applied in order to better analyze data. The values below were in order to keep similar distributions between classes:
bins_price = [0, 100, 150, 200, 250, 300, 400, 600, 1000, 100000]
price_labels = [0, 1, 2, 3, 4, 5, 6, 7, 8]
listing['price_bins'] = pd.cut(listing['price'], bins_price, labels = price_labels, include_lowest = True)
listing['price_bins'] = listing['price_bins'].astype(int)
listing['price_bins'].hist(bins = 30);
Analyzing different values for host_neighbourhood
listing['host_neighbourhood'].value_counts()
In order to analyze price for diffenrent neibourhoods, the column price_bins was used instead of the price column, since its well distributed between each class, therefore the mean will be less sensitive to outliers.
listing.groupby('host_neighbourhood').agg({'price_bins': 'mean'}).sort_values(by = 'price_bins')
There are 163 different neighbourhoods, which makes the analysis difficult based on the price_bins mean for every single neighbourhood. There is also the problem that some neighbourhoods, only contains one sample, which can cause wrong intuitions about the location average price (clusterizing some locations could be one solution, but it isn't the best approache).
One way to come across this problem was to make a visual analysis.
# In order to use the function below, you must have a public token to use the map-ox api. Check more on: https://www.mapbox.com
px.set_mapbox_access_token('pk.eyJ1IjoiZGFuaWVsZGFjb3N0YSIsImEiOiJjazZzMGZ0c3gwYncwM2tzNW51d3B2ajUyIn0.U6j8vTW4kIJal4aBWEyDtQ')
fig = px.scatter_mapbox(listing, lat='latitude', lon='longitude', color='price_bins', size_max=20, zoom=9)
fig.show()
# Print Screen from image above, for visualization on GitHub:
from IPython.display import Image
Image(filename='Images/price_location.PNG')
From the image above its possible to observe that, the price tends to be higher for Airbnbs that are close to the beach. For those people that are more familiar with Rio de Janeiro geography: we can also observe that the price is higher, in average, for the most famous neighbourhoods: Leblon, Ipanema, Lagoa and Barra da Tijuca.
Checking of null values.
listing['host_response_rate'].isnull().sum()
There are 11907 null values for the column host_response_rate. There isn't a way to fill up this huge gap. Therefore, the analysis will be done with the rest of the dataset, which isn't a poor analysis, but not a perfect one either, since there are still 16434 samples.
# Creating a new dataset with the non null values of host_response_rate and review_scores_rating
response_and_score = listing.loc[(~listing.host_response_rate.isna()) & (~listing.review_scores_rating.isna())]
func_clean_response = lambda x: int(str(x).replace('%', ''))
# removing character '%' from host_response_rate
response_and_score['host_response_rate'] = response_and_score.host_response_rate.apply(lambda x: func_remove_character(str(x), '%'))
response_and_score['host_response_rate'] = response_and_score['host_response_rate'].astype(int)
response_and_score.review_scores_rating.hist(bins = 30);
As observed above, the host_response_rate has very imbalanced distribuition, which can cause a misleading analysis. Just like it was done for the column price, a binning thecnique will also be applied to this column.
score_bins = [0, 96, 100]
score_labels = [0, 1]
response_and_score['review_scores_rating_bins'] =pd.cut(response_and_score['review_scores_rating'], score_bins, labels = score_labels, include_lowest = True)
response_and_score['review_scores_rating_bins'] = response_and_score['review_scores_rating_bins'].astype(int)
response_and_score['review_scores_rating_bins'].hist();
response_and_score['host_response_rate'].hist(bins = 30);
The same technique will be applied for host_response_rate. In this case, the technique is applied in order to facilitate the analysis:
response_rate_bins = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90,100]
response_rate_labels = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
response_and_score['host_response_rate_bins'] = pd.cut(response_and_score['host_response_rate'], response_rate_bins, labels = response_rate_labels, include_lowest = True)
response_and_score.groupby('host_response_rate_bins').agg({'review_scores_rating_bins':'mean'}).reset_index()
The result shows that the review scores rating isn't affected by the host response rate. The conclusion goes against the premise that this two variables would be directly proportional. The conclusion isn't an absurde, since not everyone cares about the host response rate, there are more important factors that influences the review scores rate: location, price, hostpitality, home cleanliness and etc.
Based on the datasets avaiables, the review.csv is the one with that has the best features to retrieve this information. The dataset contains the date of each review. Therefore, this analysis will be based on the number of reviews per day. The assumption that it's made here, is that most of people leaves a review after a stay.
After a quick research I found that approximately 60%-70% of guests leaves a review
# Checking for null values
reviews.isnull().sum()
Prepocessing date column and grouping reviews by month. A monthly window was chosen for analysis.
# Convert datime format
func_datetime = lambda x: str(x)[:7]
reviews['date'] = reviews.date.apply(func_datetime)
reviews['date'] = pd.to_datetime(reviews['date'])
reviews['month'] = reviews['date'].dt.month
reviews['year'] = reviews['date'].dt.year
reviews = reviews.sort_values(by = 'date').reset_index(drop = True)
reviews_time_series = reviews.groupby(['date', 'month', 'year']).agg({'listing_id': 'count'}).reset_index()
reviews_time_series.rename(columns = {'listing_id': 'total_reviews'}, inplace = True)
# Checking if all months are presented in every year
reviews_time_series.groupby('year').month.count()
reviews_time_series.set_index('date', inplace = True)
reviews_time_series.head()
Plotting time series. Total_reviews vs date(month)
reviews_time_series.plot(y = 'total_reviews', linewidth = 5, fontsize = 20, figsize = (20,10));
plt.ylabel('Number of reviews', fontsize = 20);
plt.xlabel('Year', fontsize = 20);
From the time series above it's possible to observe a trend. In the following graph the trend was extract from the time series by taking a rolling average, which means that, for each time point, the average of the points is taken on either side of it. For the window size, a window of 12 months was used, in order to observe yearly seasonality.
# Checking the existing of a trend
reviews_time_series.total_reviews.rolling(12).mean().plot(figsize=(20,10), linewidth=5, fontsize=20)
plt.xlabel('Year', fontsize=20);
For analyzing seasonality, we must remove the trend. A "differencing" method was used to obtain the seasonality, as presented below. Nevertheless, the graph below wasn't promising, it's not clear what is the time series seasonality
# Checking the existing of seasonality
reviews_time_series.total_reviews.diff().plot(figsize=(20,10), linewidth=5, fontsize=20)
plt.xlabel('Year', fontsize=20);
From the box plot below it's possible to analyze that january has a higher mean value when compared to the other months, which can be explained by the fact that's summer season in Rio de Janeiro
sns.boxplot(x = 'month', y = 'total_reviews', data = reviews_time_series);
Based on a yearly seasonality what is expected to analyze from an autocorrelation plot is a spike at 12 months, meaning that the time series is correlated with itself shifted by twelve months.
From the autocorrelation plot we observe that, the time series is correlated (with a 95% confidence interval ) to its 24 past months. This result isn't very clear, and a further study is required for better conclusions.
pd.plotting.autocorrelation_plot(reviews_time_series['total_reviews']);
listing['accommodates'].describe()
For higher accomodates, that less sample it has.
listing['accommodates'].hist(bins = 30);
Values that are bigger than 16 are getting groupped together
listing.loc[listing['accommodates'] >= 16,'accommodates'] = 16
listing['accommodates'].hist(bins = 30);
As suspect, the variables are directly proportional.
listing.groupby('accommodates').agg({'price_bins': 'mean'})